This application is a continuation of prior co-pending application 09/052,591 
and the remarks relate to the final office action dated May 19, 2000 in that case. 

The Examiner rejected claim 1 as being indefinite. Claim 1 has been amended 
to recite "where said difference image used for said fitting is free from being transformed as a 
result of step (d)" as suggested by the Examiner. 

The Examiner rejected claim 10 as being indefinite as related to "non-temporal". 
Claim 10 has been amended to remove the non-temporal limitation. 

The Examiner rejected claims 1-4, 6, 7, and 1 1 under 35 U.S.C. Section 103(a) 
as being unpatentable over Rabiner et al. (article titled "Object Tracking Using Motion- 
Adaptive Modeling of Scene Content") in view of McLaughlin (article entitled "Randomized 
Hough Transform: Better Ellipse Detection"). Rabiner et al. describe a face location detection 
system that uses a motion-based segmentation procedure to differentiate the background from 
the foreground objects. A combined motion-and-edge image is created by overlaying a 
decimated edge image onto a globally compensated decimated difference image (which is 
referred to as motion data). From this a ternary motion-and-edge image is created where the 
pixels can have one of the values, b 0 , b„ and b 2 . Edge data pixels are set to b 2 , motion data 
pixels are set to b„ and remaining pixels are set to b 0 . Finally, data areas classified as 
background are erased in order to create a foreground motion-and-edge image, as shown in FIG. 
3 of Rabiner et al. 

The algorithm disclosed by Rabiner et al. then looks for "best elliptical fits" to 
the clumps of motion-and-edge data using this foreground motion-and-edge image, as disclosed 
in section 3.1. Section 3.2 discloses a set of "semantic rules" naturally imposed by the scene 
content. In particular the semantic algorithm keeps track of (i) location, (ii) size and shape, and 
(iii) the number of objects. Based on this semantic information the algorithm adapts the search 
range of the ellipse parameters in the current frame. The purpose of the semantic rules is to 
make the tracking algorithm both temporally adaptive and knowledge-based (section 3.2). 
Based jointly on measure of fitness and on the semantic rules (sections 3.1 and 3.2) the 
algorithm selects the final ellipse. However, Rabiner et al. disclose a system where the actual 
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selection of the final ellipse is based on the clumps of motion-and-edge data (i.e., based u pon a_ 
transform^jtoeJiffe£gnc.e.iniage.in.a spatial domain to a parameter space), and not based on 
the non-transformed difference image. 

The Examiner noted that "Rabiner does not teach the specifics of determining 
the 'best elliptical fits' based on a transform of the difference image from a spatial domain into 
parameter space" (end of page 7 and start of page 8 of the Office Action dated May 19, 2000 in 
the parent application). More accurately, the applicant would point out that Rabiner explicitly 
teaches determining the actual selection of the final ell ipse based on the clumps of motion-and- 
edge data, which is the difference image ,trai#^edmtojte j)a«meter space. It is not merely 
a matter of Rabiner not "teaching the specifics", as asserted by the Examiner, but more 
accurately Rabiner in fact teaches the determination of the best elliptical fits (albeit tersely) in 

the parameter space . 

The Examiner asserts that McLaughlin discloses a method for detecting ellipses 
in an image comprising transforming the input image into Hough parameter space. The 
Examiner further states that it would have been obvious at the time the invention was made to 
one of ordinary skill in the art to utilize the ellipse detection method taught by McLaughlin, in 
order to find candidate elliptical facial outlines as required by Rabiner, thereby being able to 

detect the candidate ellipses. 

In the event that the teachings of McLaughlin are used, as suggested by the 
Examiner, to find candidate elliptical facial outlines as required by Rabiner, this would provide 
more specific details as to how the candidate elliptical facial outlines are determined. 
Accordingly, the McLaughlin technique would be used to determine the actual selection of the 
final ellipse based on the clumps of motion-and-edge data, which is the difference image 
transformed into the parameter space of Rabiner. 

Claim 1 patentably distinguishes over Rabiner et al. in view of McLaughlin by 
claiming the fitting of the plurality of candidate facial regions to the difference image to select 
one of the candidate facial regions, where the difference image used for the fitting is free from 
being transformed as a result of step (dV In contrast, Rabiner et al. disclose using the 
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transformed difference image from which to fit the candidate facial regions. Moreover, Rabiner 
et al. in view of McLaughlin would result in a more specific implementation, namely the 
incorporation of the ellipse detection method of McLaughlin using the transformed difference 
image of Rabiner et al. The applicant would respectfully disagree with the Examiner's 
assertion that the limitation "where the difference image used for said fitting is free from being 
transformed as a result of step (d)" is naturally met by the Rabiner/McLaughlin combination 
(office action, page 7, lines 13-15). In addition, the Examiner stated that "using the 
McLaughlin transform, best fitting ellipses are found in the Rabiner difference image" (office 
action, page 7, lines 15-16). The applicant would respectfully note that the best fitting ellipses 
are actually fitted based upon the transformed image in parameter space, not the difference 
image as asserted by the Examiner. 

Claims 2-9 depend from claim 1, either directly or indirectly, and are patentable 

for the same reasons asserted for claim 1 . 

The Examiner rejected claims 10, 13, and 14 under 35 U.S.C. Section 102(b) as 

being anticipated by Rabiner et al. 

Rabiner et al. teach that the algorithm uses the previous best-fitting ellipses and 
transforms them using the estimated pan and zoom parameters to obtain a prediction of where 
these objects should be in the current frame if the motion-and-edge data disappears. The 
algorithm therefore tracks the information from the previous frame: (i) location of objects, (ii) 
the size and shape of these objects, and (iii) the number of objects of interest. Based on this 
information, the algorithm adapts the search range of the ellipse parameters in the current 
frame. (Rabiner et al., section 3.2). In essence, section 3.2 of Rabiner et al. teach the use of 
temporarily relevant data for fitting. In addition, section 3.1 of Rabiner et al. teach the use of a 
fitness metric. However, both the fitness metric (section 3.1) and the semantic rules (section 
3.2) are used for face and body tracking by being applied to the transformed difference images, 
and more particularly, to the foreground motion-and-edge images. 

Claim 10 patentably distinguishes over Rabiner et al. by claiming at least two 
factors of a. fit factor, a location factor, and a size factor being fitted to the difference image. 
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Claims 11-14 depend from claim 10, either directly or indirectly, and are 
patentable for the same reasons asserted for claim 10. 

The Examiner rejected claims 15-17, 19, 37, 38, 39, 41, 42, 44, 48, 50, 58, and 
59 under 35 U.S.C. Section 103(a) as being unpatentable over Eleftheriadis et al. (U.S. Patent 
No. 5,852,669) in view of Rabiner et al. (article titled "Object Tracking Using Motion- Adaptive 
Modeling of Scene Content") and Stelmach et al. (article titled "Processing Image Sequences 
Based on Eye Movements"). 

Eleftheriadis et al. disclose a system for facial image coding comprising 
determining a facial region from a video and calculating a sensitivity value, as defined by the 
Examiner, for each of plural locations within the video based upon the facial location. The 
sensitivity disclosed by Eleftheriadis et al. are "quality levels" assigned to different portions of 
the image. The quality level is used to adjust the number of bits used for encoding different 
portions of the image. 

Rabiner et al. disclose a system that includes the use of multiple frames for 
tracking. The Examiner suggests using the detection and tracking system of Rabiner et al. with 
the Eleftheriadis system. 

The Examiner asserts that Stelmach et al. (article entitled "Processing Image 
Sequences Based on Eye Movements") disclose a system that utilizes the human visual 
system's response to calculate a sensitivity value for each of plural spatial locations within an 
image. 

Claims 15 and 37 patentably distinguish over the cited combination because 
there would be no motivation to include a non-linear model of the human visual system's 
ability to perceive image detail at eccentric visual angles with the system taught by Eleftheriadis 
et al. because this would alter the encoding bit allocation of Eleftheriadis from the allocation of 
bits based on the "energy" of the image (e.g. texture) which is a measure of the content of the 
image itself to a completely different bit allocation technique based on the perception of image 
detail at eccentric visual angles. The Examiner has attempted to suggest that the goal of 
"variable compression while maintaining high quality in areas of interest to the viewers" which 
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angles of a particular region of the frame of the video, where the particular region of the frame 
is determined based upon the content of the frame itself Claim 21 further claims encoding the 
frame in a manner that provides a substantially uniform apparent quality of the plurality of 
location of the frame to the viewer when the viewer is observing the particular region of the 
frame of the video . In contrast, Ruoff, Jr. discloses using an eye tracker 100 to determine the 
particular region of the frame which is not based upon the content of the frame itself, but rather 
the position of the viewer's line of sight. 

Claims 22-30 dependent on claim 21, either directly or indirectly, and are 
patentable for the same reasons asserted for claim 21 . 

The Examiner rejected claims 22, 23, 27-29, 31, and 33 under 35 U.S.C. Section 
103 over Ruff, Jr. and Stelmach et al, and further in view of Foster (article entitled 
"Understanding MPEG-2") and Ding et al. (article entitled "Rate Control of MPEG Video 
Coding and Recording by Rate-Quantization Modeling"). 

Ding et al. teach a system for achieving a target bit rate with consistent "visual 
quality". As suggested by Ding et al. to achieve consistent visual quality, a reasonable 
alternative is to select identical reference quantization parameters so as to distribute bits within 
a picture for uniform quality. (Ding et al. IIB). The Examiner notes that Ding et al. suggest the 
allocation of "bits to each picture according to image activities", which refers to the total 
number of bits to allocate to any particular frame of a video and not the bits to allocate to any 
particular portion of a particular image, as claimed in claim 22. 

The Examiner further notes that Ding et al. suggest quantization step size are 
adjusted according to an allocated quantization divided by a scaling factor. However, Ding et 
al. states that this technique results in "uneven image quality either from picture to picture or 
within each picture". Claim 22, dependent from claim 21, specifically claims substantially 
uniform apparent quality of the plurality of locations of the frame to the viewer with at least 
two different quantization values. 

In section III Ding et al. teach the use of a constant quantization value for 
uniform visual quality. However, the uniform visual quality is not encoding in a manner that 
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provides a substantially uniform quality to perceiving detail at eccentric visual angles, nor 
calculating sensitivity information ... of a human visual system of a viewer perceiving image 
detail at eccentric visual angles, as claimed in claim 21 . 

Moreover, claims 21-23 include the limitation that the particular region of the 
frame is determined based upon the content of the frame itself, which is not taught by Ruoff, 
which discloses using an eye tracker 100 to determine the particular region of the frame, as 
previously described. 

Claims 22-30 depend from claim 21, either directly or indirectly, and are 
patentable for the reasons asserted for claim 21 in addition to the foregoing discussion as 
related to the particular limitation of claims 22-30. 

The Examiner is respectfully requested to reconsider claims 1-51 and 53-61, in 
light of the foregoing amendments and remarks, and to pass claims 1-51, 53-61 to issue. 



Respectfully submitted, 
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METHOD FOR ADAPTING QUANTIZATION IN VIDEO CODING USING 
FACE DETECTION AND VISUAL ECCENTRICITY WEIGHTING 

The present application is a continuation of co- 
pending patent application, Daly et al . , Serial 
No. 60/071,099, filed January 9, 1998. 
BACKGROUND OF THE INVENTION 

The present invention relates to a system for 
encoding facial regions of a video that incorporates a model 
of the human visual system to encode frames in a manner to 
provide a substantially uniform apparent quality. 

In many systems the number of bits available for 
encoding a video, consisting of a plurality of frames, is 
fixed by the bandwidth available in the system. Typically 
encoding systems use an ad hoc control technique to select 
quantization parameters that will produce a target number of 
bits for the video while simultaneously attempting to encode 
the video frames with the highest possible quality. For 
example, in digital video recording, a group of frames must 
occupy the same number of bits for an efficient fast- 
forward/ fast -rewind capability. In video telephones, the 
channel rate, communication delay, and the size of the 
encoder buffer determines the number of available bits for a 
frame . 

There are numerous systems that address the 
problem of how to encode video to achieve high quality while 
controlling the number of bits used. The systems are 
usually known as rate, quantizer, or buffer control 
techniques and can be generally classified into three ma D or 
30 classes. 

The first class are systems that encode each block 
of the image several times with a set of different 
quantization factors, measure the number of bits produced 
for each quantization factor, and then attempt to select a 
35 quantization factor for each block so that the total number 
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:Ter fullness block activity, and use all these measures 
To e " , Ration factor for each block of the image . 
Such technics are popular for real-time encoding systems 

* fWr low computational complexity, 
because of their low P inaC curate and must 

Unfortunately, such techniques are quite 
be combined with additional techniques to avoid bit 
buffer overflows and underflows. 

The third class are systems that use a model to 
^ -~f hit^ necessary for encoding each ot 
predict the number of Mts neces y ation £actor 

the image blocks r„ erm f h b 1 c , ^ 
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th at use face detection is describe e - _ ^ 

Zhou, U.S. Patent No. 5,55U,=oj-, 
bit rate audio and video communication system that 
dynamically allocates bits a^ng the audio and vrdeo 
formation based upon the perceptual sxgnrfrcance of 
, the audio and video information. For a video 

teleconferencing system Zhou suggests that the pe P 

-imoroved by allocating more of the viaeo 
^ „ the facial region of the person than the remainder 
of re 0 L » suggests that the mouth 

s e i uding the lips, «- cheeks, should be 



allocated more video bits than the remainder of the face 
because of the motion of these portions. In order to encode 
the face and mouth areas more accurately Zhou uses a 
subroutine that incorporates manual initialization of the 
position of each speaker within a video screen. 
Unfortunately, the manual identification of the facial 
region is unacceptable for automated systems. 

Kosemura et al . . U.S. Patent No. 5,187,574, 
disclose a system for automatically adjusting the field of 
view of a television door phone in order to keep the head of 
a person centered in the image frame. The detection system 
relies on detecting the top of the person's head by 
comparing corresponding pixels in successive images. The 
number of pixels are counted along a horizontal Ixne to 
determine the location of the head. However, such a head 
detection technique is not robust. 

Sexton U.S. Patent No. 5,086,480, discloses a 
video image processing system in which an encoder identifies 
the head of a person from a head- against -a-background scene. 
The system uses training sequences and fits a minimum 
rectangle to the candidate pixels. The underlying 
identification technique uses vector quantization. 
. Unfortunately, the training sequences require the use of an 
anticipated image which will be matched to the actual image. 
Unfortunately, if the actual image in the scene does not 
sufficiently match any of the training sequences then the 

head will not be detected. 

Lambert, U.S. Patent No. 5,012,522, discloses a 
system for locating and identifying human faces in video 
scenes A face finder module searches for facial 
characteristics, referred to as signatures, using a 
template.. In particular, the signatures searched for are 
the eye and nose/mouth. Unfortunately, such a template 
based technique is not robust to occlusions, profile 
5 changes/and variations in the facial characteristics. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

FIG 1 is a block diagram of an exemplary 
; embodiment of 'a face detection module of the present 

invention.^ ^ ^ ^ ^ „ eightlngs £or 

* <-v, 0 fare detection module of FIG. 1. 
centers ofrte face de ^ ^ ^ ^ rf ^ ^ 

0 detection module of FIG. 1- ^ noif q pre d 
FIG 4 is an example of centers of considered 

! distance on the display of a viewer's focus and the 

resulting visual angle of the ^ M angle 

FIG 7 illustrates an eccentricity ± 
for each location as a function of the distance from the 

30 detected _ a „ eccentrlc it y versus location 

fnr- a series of viewing distances. 

FIG 9 illustrates a set of visual sensitivity 
data sets for' absolute sensitivity of the human visual 

3 5 system. 



PIG . 10 illustrates the visual sensitivity as a 
function o^pixel location^ ^ ^ ^ o£ 

n ^ fnr an elliptical object. 
~» ltiVlty pi r » I- an exemplary embodiment of a block 
aiagram o£ a block-based image encoding system of the 

preS ent ^;- ti -- ulustrates . se t o£ qu antization steps 

versus block number for one row of blocks in a frame^ 
FIG 14 is an exemplary block diagram of an 

• frfinc the face detection module of FIG. 1, the 
encoder including the *ac ^ image encoding 

visual model of FIG. b, aau 

12 of the present invention, 
system of FIG. 12, ^ ^ e ^ piary Wock dia9ram o£ a decoder 

of the present invention. 
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whe re T xs a P-def thresholde d image xs 

absolute value. The Ther eafter, morphologxcal 

represented by 1'b and 0 b^ egholded image which, for 

operations are performed on the t . ^ ^ 

example count the number of ^ ^ count within 

of . regions of the xmage. 1(j within 

a region is ^ ^ 

that region are set to • the non -zero 

wi thxn the thresholded xmage large then all 

pixel count within a regxon ^ enhanceS 

the pixels within that regxor ^b - ^ 

those regions of the xmage xnd xcat . 8 fchat 

. overall effect of the -rpho^ogxcal ope ^ ^ ^ ^ 

scattered ungrouped non-zer P regions 

holes (indicated by zeros) xn th g . ^ 
_ ,*t- to one. The output from the 



are set to one 
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35 



are ocv. — 

hlocK 22 is a cleaned »age 2 ^ id!ntiii8i within the 

. " eXt ' « 1 ally Has an eUiptical shape so 

cleaned xmage. A face g g e ag a model to 
the face detection module 10 adop 

represent a face within f ^ QUtUnes may 

(hair) and lower (chxn) areas ^ & ^ 

hav e 5 uite «««- t d ^ V ; c t ; u r ; ac ; and pa rametric simplicity 

tra de-of f between model accu y in£o rmation xs 
Moreover, due to the fact that 



12 • e a small 

d to generate the ^ ^ 
not actuary accura cv coains P-cess. 

la cK of ^ 1 - £lCt ;^ 1 performance of t» . can b e 

— - r ^ °* — ametric 

,pd by the follow^ q 
represented y . = Q 

to reduce the ^ portion dius and 

' U -i* * «** 0 0 .I head has a P^.^ indicator 
The top ot provides a c se i e ct 



15 



20 



~ The top ot — es a c0 »— 

£or . person ^ u the £ti ^ suc h 

r::. — « -^u. ^» ^ ^ator 

circus *«* 2 * ^ pa^et- xn HOU9h 

»ray of * H °^ rclBS 1.= 

transfer* ^..^X V ^ arf (x , y , are pr- 

where A t ' ££eI ence i»ge 2 image into 

- < Hou 9 h "-r!^ can - 



25 



.here A I > dif ference image 2 into 

. parameter ««~ ^ the ^tU^ to the 

^ * nnTc-rast to -OK- 
parameter sp denti fied, m which does 

oform can be image sy<* face , as tauyi 

transtorm _ ixe ls m tne oC Q £ the f ace ' 

aeries of P lXe rU rvatures 01 

n for a sen citable cux 

0 , a iv detect sui ch ot 

arcurateiy u , rlr 2 6 scores 

I; BXeftheriaai. ^ cUcl- the ti t of the 

. d \fX; ieg - Ce r-ective can^ate 

C t^erence » » 

35 cleaned dH 



13 

If C is a 



• rlP 27 The fit criterion used is as follows 
circle 27. The & ^ ^ ^ ^ 

candidate circle 27, tnen xe 

M c (k)= 1; k inside or on C 

0 ; otherwise . 

„, the circle contour, denoted C 1; if the pixel 
5 A pixel * » on the c rc ^ ^ ^ ^ ^ 

is inside or on the circie, . . yel k is on 

«„ its (2L + 1) x (2L + 1) neighborhood is not. A pixel xs 
in its UWiJ x is outsl de 

the circle border, denoted by C e , if the pix 

10 C™> -i^rhood is ei-r inside o, « 
normalized average intensities I, and I. 

I^d/ICiDZd^'-tW where kec, 

and . ., 

I e =(l/|C e |)i:d i+n th (k) where kec ie 

wHn*Vitv The measure of fit is then 
15 where |.| denotes cardinality. 

defined as: 

R - (1 + 1,1/(1 + 1.) the 

A large value of R indicates a good fit of the 
candidate circle 27. In contrast, a small value of R 
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candidate circle 2 +" candid ate circle 27. 

indiC3teS ^ C — « « — h 3 
reasonable estimation ^7^^"^ came 
respective candidate circle 27, the pre 

respecti tele conf erencing devices have 

Carmine the appropriateness o ^ ^^^^ 

o£ the -ce is usua y » " ^ ^ or excesslvely 

candidate crcU th » in a<wltion to the £ it 

large are not sultE * le fclock 26 also examines the 

aata, the score ca.^c ^ ^ ^ ^ 

size and location of the circie .^wip 
size ,. or .i av 3 8 i s an unsuitable 

oute .order ^ circle a , the central 

location for a cenuei 
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upp er third region 42 of the display 38 i. a 

^^ZZ^ToL 9 ^ region ,0 may have a 
example, un accep table region 44 a 

weighting factor of 0.2S. J* P ^ 

weighting factor of 0.5 ana the ^ 
third region 42 a we ghti* of ^ ^.^ 

the radii of candidate circles 1 a 

^-it-ability. A candidate circie 
distribution of su " abllltY 5Q is unde sirable and may 

ra dius less than the sma 1 ^ 5 circle 27 „ ith a 

be given a weighting ° ^ A an intermedia te radii 

radius between the small radii 

52 is desirable and ^ ^ radii S3 
remainder of the possible 1 g ^ q 2 ^ 

are undesirable and may be gi ^ . 

other suitable weighting factors may be used 

^ "-The' three parameters used to determine the 
suitability of a candidate circle are th< , fi : th ^ 
candidate circle to the cleaned image data th ^ 
the candidate circle. ^ ^ paramet ers 

circle .s radii. Any suita ter+ ( . 25) , siz e . The 

may be used, such as (0.5) fit+l subseq uently 
candidate circles with the highest score ^ 
s used as poten ^1 location ° f t h f« ^ 
matching with Candida e ellipses t determination 

0 ^irtate ellipses block 28 generates 

and scored, a generate c an i e U P ^ ^ ^ 

image 23 tor ^ region arou „d the center of 

E11 ipses w th a cente ^ ^ ^ ^ ^ 

35 the suitable candidate circle 



10 



15 



20 



f respective candidate circle are 
g enera! range of that o *e " pec ^ 
considered. Referring to 10 , ^ ^ ^ & ^ in 
ellipses include a set dlrect ion about 

the horizontal direction and the e 

t he center 33 of the ^ c ^^ection. 

of candidate ellipse centers in the ve ^ ^ 

i3 greater than the range of ^ ^ the lncrease d 

ch e horizontal direction, X , because face s tend 

variability in the vertical direction . so 

to have an elliptical ^^ rection o£ th e 
increased lability in ^e permit3 a better fit to the 
center of the can idate^i Up . P ^ ^ ^ ^ _ y 

actual face . 1 so les s variability is 

»ch in the horizontal ^«ct ^ q£ . n . ^ 

necessary. In other »° < ^ ^ con(ldence in the 
second candidate circle than the vertical 
centers in the horizonta di e ^ ^ ^ 

numoej. ^ system. 

computational ^"""^ candidate ellipses for each 

Preferably, a set a reg . on „ 

circle center are considered wit . less 

ar ound the circle «nter 33 and h 9 ^ 

than and greater than «« ' ^ ^ 

Th e candidat el P fey ^ score 

candidate ellipses block 2 « ^ 

candidate ellipses block 32 ^ ^ score 

ellipses 29 using the same ^ eUip tical mask 

candidate circles block 26, excep candida te 

t £ a rircular one. Tne sluic 

l0 is used instead of C1 use the center location 

ell ipses bloc, ^/additional parameters, if 

of the ellipse and its radix 

desired. plliDS e 39 with the highest score is 

The candidate ellipse ^ ^ ^ ^ 

»- ai hv an output top w» lu 
35 then output 41 Dy an 
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remainder of the system. The parameters provided by the 

center y vertical location of ellipse center 

radius x x axis radius of ellipse 

radius y V « diUS ° £ elliP36 

angle 6 tilt. 

The tilt parameter " circles tnro ughout the 

An alternative is to use ci^ 

e 10 and remainder of the system as 
S.-^rnL to a face a.outputthe — 

- the r:- The - — — 

irtlr y - vertical location of circle center 
radius of circle. 
„ is CD Be ™a.,.>~> »« ««' I-"—"' 

«„. ,„„.. ...... . : ^-i;.^r:r;.:'~s-» 

such as a diameter which is represe 

To track the face between successive frames, 

f f he video are obtained and cleaned by 
another two frames of the video determination 
the face detect ion modu e 10. h been 

blo ck 2 B eaned imag6 2 3 from the clean 

determined and passe t 
. difference imag o * ^ jo determines 

block 30. The sel ^ ^ tQp 

- -ut top candidate block 3, 

Th e set of potential ^ ^ , horizontal 

0 substantially equal range « signi£ica nt reason to 

a nd vertical direction •• candidate ell i P ses 

include the variability of the g ^ 
block 28 used for the initial tace p li)C ely 
■ of the face is already determined and most likely 
location of the tace „ ac kinq involves following 

35 has not moved much. Subsequent tracking 
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the motion of the head itself where motion i. just as likely 

e Lr the vertical and the horizontal directions as 
^sed to a determination cohere the ~£ 

to the previous top candidate ellipse 41 because it is 
:„li k e ly that the face has become substantially lar g er or 

substantial ™ ^ - difference 

significantly temporally different. 

In variability between the horizontal and vertical 
^rectifns reduces the computational retirements that would 

- r rdidirrnpses - - T 

enipses bloc, 30 ^ - ^J^^ 
block 32, as previously described. P hiqhes t 

^^^i-irqat-P ellipse with the nignesu 
Mnrk 34 outputs the candidate emp* 

block ou P initial determination 

zz z rairrfrretection ^. » «~ - 

irJT-^r^:-.. may be extended to 
£aces in such a case the output top 
detect multiple faces. ra „„ ters £or each 

candidate block 34 would output a set of parameters 

dete ltterna t ive face detection technics may be used 

£ t-hp face within an image. m 
t-n determine the location of the race wi 
uch a case the output of the face ^^ion^ u e is 
representative of the location of the face and its 
„ ithin a ^ a gaze detection module which detects 

™ifion may be used to determine 
0 the actual viewer, eye - ^ ^ ^ ^ within 

the location of the regiuu 

~ This mav or may not be a face, 
a video. This may u realization that 

The present inventors came to the realizati 

the human eye has a sensitivity to image detail that is 
l5 dependant on the distance to the particular pixels of 
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image and the visual angle to the particular pixels of the 
image. Referring to FIG. 5, the system includes a non- 
linear visual model 60 of the human eye to determine 
appropriate weighting for each of the pixels or regions of 
the image. 

• The visual model 60 calculates the sensitivity of 
the human eye versus the location within the image. 
Referring to FIG. 6, the visual model 60 initially 
determines the relationship between a distance 62 on the 
display 38 of the viewers focus 64 and the resulting v 1S ual 
angle 66 of the viewer 68 to the end of the distance 62. 
The visual angle 66 will depend on the anticipated viewing 
distance of the viewer. The angular relationship is 
preferably specified in multiples of image heights or pixel 
heights, as opposed to absolute distances. The angular 
relationship is also preferably set for the particular 
system based upon the expected viewing distance and 
particular display 38. Alternatively, the angular 
relationship could be determined by a sensor determinxng the 
viewing distance together with information regarding the 

particular display 38. 

Referring to FIGS. 5 and 7 , the visual model 60 
calculates at block 62 an eccentricity in visual angle for 
each pixel, location, or region as a function of the 
distance 63 from the detected region boundary 65 of the face 
from the output 41 of the output top candidate block 34. 
The pixel distance from the region boundary is: 



e £ =- 



180 



■ tan 
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where 0 e is the eccentricity in units of visual angle, y is 
the vertical pixel position in the image frame,, and x is the 
horizontal pixel position. The following four parameters 
are the outputs from the face detection module 10: y c , c x , 
y and x r , where x c and y c are the (x,y> center positions of 
the selected ellipse in the frame, and x r and y r are the 
elliptical radii in the horizontal and vertical directions, 
respectively (i.e., the horizontal and vertical minor and 
maj or axes are 2x r and 2y r , respectively) . V is the viewing 
distance in the units of pixel distances, (e.g., in viewing 
an image with a height of 512 lines of pixels with a viewing 
distance of 2 picture heights, V=2*512=1024) . ■ 

Referring to FIG. 8, a graph of the eccentricity 
(in visual angle) for a single pixel location for a series 
of viewing distances, from 1 image height to 6 image 
heights, is shown. The viewing location is the center of a 
640 by 480 pixel display. For example, a viewer at a 
distance of 6 image heights 70 observes 6 degrees 
eccentricity in comparison to a larger 3 5 degrees of 
eccentricity at a distance of 1 image height 72 when looking 
at the edge of the display 38. It is noted that x R and y R 

are both zero in FIG. 8. 

The visual angle of the viewer to each pixel of 
the image is then used as a basis of calculating, at block 
63 the viewer's sensitivity to each pixel or block based on 
a non-linear model of the human visual system. Referring to 
FIG 9 and the eccentricity calculation of FIG. 8, a set of 
measured data sets 80 and 82 (actual data) for absolute 
sensitivity of the human visual system is obtained across 
all frequencies. The data sets 80 and 82 are used to 
determine the maximum sensitivity to the frequency response 
of the human visual system. A Cortical Magnification 
Function (CMF) (shown below) fits the data well and provides 
data set 84, which is a function of how many brain cells are 
allocated to each visual field location. In essence, FIG . 9 
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10 



20 



Mr actual model of the sensitivity of 
illustrates a non-lmear accuai . . _ . . .. . . _ 

— ~- function of eccentricity. The 

the human visual system as a function o . 
££ZJi£~>-->~"- 1 t * *7c*f\ fnr use in qeneral rate 

sensitivity can be normalized for use m g 

cln rd or an Solute value where visually lossless quality 

seeded. Apply^ the sensitivity data of PIO. 9 to a 

Pixel image resuits in an image of the same size as the 
pl xel imag ima9e) and 

original pixel image I or oi 

P ives the visual sensitivity as a function of pixel 

9 . , n FIG io The CMF equation governing 

location, as shown in FIG. 10. 



data set 84 FIG. 10 is: 

1 

5 = 



where S is the visual sensitivity, K ECC is a constant- 

, • n oa) and Q? is the eccentricity m 
15 (pre£err ed .alueis 0. 4, and 9 ^ The CMF equation 

visual angle as given m tne h 

is referred to as the Cortical Magnification Function, 
result is a sensitivity image, or map. that can be 
Z lined at any desired resolution with 
starting image sequence. The CMF equation may also be 
allied to the image where the viewer is observing any 
arbi rary location 90, resulting in different sensitivity 
^ues for the pixels. In the preferred embodiment the 
nation 90 is the top candidate ellipse « for the 

" ^ timl *\T:'l, iUustrates the resulting cross section 
„f the sensitivity values for an elliptical object with a 
/ius of 100, centered at position 9 6 -olid line, and at 

, j, -i ■ „\ it- iq also possible to use une 
nnsition 98 (dashed line) . It is ait>u y 

30 -ll weighting of ^-^'1 

constant, namely^.^ ^ ^ ^ ^ 

35 models based on the actual human visual system may liaise 
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• _ ^eitivitv information with pixels, 
be used to associate sensitivity m 

^r- rpnions of an image . 
10Cat 10 ' Z -del 60 produces sensitivity 

• q a function of the location within the image 
information as a function _ ^ 

in relation to a model o the hum i ^ ^ ^ 

values preferably range from 0 to ^ ^ 

sensitive. Referring again to FIG. 1. ^ 

> v associated with each region, block, or p 
sensitivity associate encoded by an 

f>w* imaae The video frame 14 needs to 
the image tranS mitted with a pre- 

encoder 100 and then store ^ ^ particular 

selected target number of bits, su 

tv,p following description is based on a y P 
system. The toiiowi y understood 

Referring to FIG. 

1**16 Pixels per block. The pixel values o 

16x16 pixels p transform 102 into a set of 

transformed by a block tra niscret e Cosine 

. - r^vofprablv by using a Discrete 
coefficients, preferably y are quantized 

/nPT \ The resulting coeu^-^ 
Transform (DCT) . The a ^ 106 

by . bloc* quarter !04 an 1th coef£icien ts 
The quantization of the tra 
■ L ouality of the encoding rf.each_image_blook 
determines the_qualit>r^ m . s 

m . The quantization of the ith i g ^ 

ran, 1 ::^ -erred to as the ^ 

the ith hioc, and ^r^^^^L^. I- 
size used -r quantizing th t n,^ ^ ^ ^ 

the MPEG-1 and the MPEG 2 coe£Jicient o£ . Mock" is 

the quantization scale and the ) . b 
quantized using a quantizer of step size Q, 
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the jt h value o £ a quantization ^ 
designer of the MPEO codec , T h, , ■ - ■ 

— a sta r::i; rr;:::r: hen - it „ 

ki^v B is a function of the value of the 
image block, B, is qtat istics of the block. 

quantization produces a large nu oduces a 

i, marser Quantization (large p-^ 

also lower - ima r rrr r.r ^ri: s 

intTd e d or o c r e n .^xn^ ^ ^ 

the Woe, n a fia^e y ^ 

P onr i n the current bloc* fro™ previously encode, 

Hnnlv the difference or prediction error is 
blOC d Id Sucn predicted blocKs are said to he interceded, 
encoded. Such pr technlques described herein are 

or of the ciass ^ both q intra and int er block, 

suitable for intra, inter, or u 

encoding techniques ^antization steps 

R ef erring to H£ 13 ^ a frame 

Q) versus blocK number ^ ™ * video coding strategies 
, shown. There are three different ^ 
discussed below. Each technique is first brie y 
then the latter two are discussed in greater detail, 
then the STRMEGy 

Th e first strategy is ^^J^^ 
. which uses the sa„e - -hod. 

in the row. This^e e red ^ ^ ^ ^ ^ 

The resulting number of bits 
referred to as B. 
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SECOND VIDEO CODING STRATEGY 
The second strategy is represented by the 

t i • too n i<? set to 0 for the block closest 
staircased line 122. Qj is seu w v 

to the location where the system has determined that the 
viewer is observing, such as the face region. In FIG. 13, 
the viewing location is shown as the middle of the row. 
q.-s are selected to be larger than Q for blocks farther 
from the center. Since all the quantization steps are as 
large as or larger than those for the fixed-Q strategy 120, 
the staircased line 122 technique will encode the blocks m 
the row with fewer bits. The resulting number of bits 
necessary to encode the row of blocks using the staircased 
line 122 technique is referred to as B', where B'<B. With 
the proper selection of Q, for each block the image quality 
will appear uniform to the human eye, as described in detail 
below. Accordingly, the perceived quality of the encoded 
images using line 120 or line 122 will be the same, but, as 
mentioned above, using line 122 will produce fewer bits. 

THIRD VIDEO CODING STRATEGY 
If the quantization steps of line 122 are reduced 
by a constant, the number of bits necessary to encode the 
blocks will be greater than B- . The staircase line 124 
represents the steps Q, • used for encoding the blocks 
resulting in the same number bits B as the line 120. The 
blocks of the entire row will be perceived by the viewer as 
having the same image quality, with the proper selection of 
the Q- 1 values. The center is quantized with step size of 
Q .<Q, 3 resulting in the image quality at the center having a 
better quality than the fixed-Q technique. Hence, the 
perceived image quality to the viewer of the entire row, 
which is substantially uniform, will be higher than the 
fixed-Q case, even though both techniques use the same 
number of bits B. The objective of the staircased line 124 
is to find the proper Qj ' values automatically, so that the 



pre-selected target number of bits (in this case B) is 
achieved. 

DETAILS OF SECOND STRATEGY 
The present inventor came to the realization that 
a coarser quantization on image blocks to which the viewer 
is less sensitive can "be performed without^ affecting the _ 
7;7ceTve^lmage quality. In fact, when encoding digital 
teteVi»ti« factor can be increased according to 
the sensitivities of. the — visual system and thereby 
decrease the.. number of bits necessary, for each frame. In 
^icular, if the entire N blocks of the image are 
quantized and encoded with quantization steps: 

Q/ Sl , Q/S 2 ,...Q/S N , E ^ atlon 1 

respectively, where S k is ^hejeneit lyity . associated to the 
kth block, the_perceived^u^y_^f^gn^gdjrame wxll 
^e-ThTs^e as if all the blocks were . quantized wxth step 
sife-Q^i^rthe"^a?e^aller than or equal to 1, the 
resulting quantizers in Equation 1 will be as large as or 
larger than Q, and therefore will produce fewer bxts when 

encoding a given frame. 

• ~~ i-v>^ rpqult of such an encoding 
To summarize, tne resuxu ut 

scheme where the sensitivities are representative of the 
perceived image quality based on a model, of the human 
ly-stemTnd- varying the quantization factor with respect to 
The' sTnsitivity" information, provide"? an image that has a 
(perceived uniform quality. This also provides a mrnrmum bit 
rate with the uniform quality. 

The following steps may be used to reduce the 
number of bits for a video frame using a preselected base 
quantization step size Q. 

STEP 1. initially set k equal to 1. 
STEP 2. Find the maximum value of the sensitivity 

for the pixels in the kth block, S k , 

._ 0 c q \ • Equation 2 

S k = max(S k>1 ,S k ,2#S k ,3i • • -^.lJ ^ 



where S k(i is the sensitivity for the ith pixel in the Icth 
block Alternatively, the maximum operation could be 
replaced by any other suitable evaluation of the 
sensitivities of a block, such as the average of the 

sensitivities. . 

STEp 3 _ EnC ode the kth block with a quantizer of 

step size Q/S k . 

STEP 4. If k<N, then let k=k + l and go to step 1. 

Otherwise stop. 

DETAILS OF THIRD STRATEGY 
in many system the total number of bits available 
for encoding a video frame is often set in advance by the 
user or the communication channel. Consequently, some rate 
or quantizer control strategy is necessary for selecting the 
value of the quantization steps so that the frame target ib 
achieved as suggested by line 124 of FIG. 13. In other 
words, selecting the number of bits results in the 
aforementioned base Q likely not matching the available 

bandwidth. . h 

A model for the number of bits invested in the ith 

image block is: 

_ _ Bi=A(K Sl +C ), ! [Equation T] 

where Qi is the quantizer step size or quantization scale A 
is the number of pixels in a block (e.g., in MPEG and H.263 
A-16 2 pixels) , K and C are model parameters (described 
below) . a, is the empirical standard deviation of the 
pixels in the' block, and is defined as: 



with Pi (j) the value of the jth pixel in the ith block and 

Pi is the average of the pixel values in the block. P, is 
defined as, 

* = il PiU) [Equation sj 



For a color image, the P t (j) 's are the values of the 
luminance and chrominance components for the block pixels. 
The model of Equation 3 was derived using a rate-distortion 
analysis of the block's encoder and is discussed in greater 
detail in co-pending United States Patent Application Serial 
No. 09/008,137, filed January 16, 1998, incorporated by 

reference herein. 

K and C are model parameters. K depends on the 
encoder efficiency and the distribution of the pixel values, 
and C is the number of bits for encoding overhead 
information (e.g., motion vectors, syntax elements, etc.). 
Preferably, the values of K and C are not known in advance 
and are estimated during encoding. 

The objective of the third technique is to find 
the value of the quantization steps that satisfy the 
following two conditions: 

(1) the total number of bits produced for the 
image is a pre- selected target B; and 

(2) the overall image quality is perceived as 
homogenous, constant, or uniform. 

Let N be the number of blocks in the video frame. The first 
condition in terms of the encoder model is: 

B = iB i =lA(K^ r + C)---: Equation ej 

i= i »=» Ui 

As described in relation to the second , strategy, the second 
condition is satisfied by a set of quantizers, 

- 9L QL 9L - - Equation 3 
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where (Q-/W-Q. is the quantization step o£ the Kth bio*, 
but now Q' is not known. 

Combining Equations 6 and 7 the following equate is 
obtained: 

B = f B^XACK^ + C).-- [Equation {] 

* j=i Q 



i 



, . nr for o' is obtained from Equation 8. 
The following expression tor y 

_ <T = l-^-ttfS; ("Equation 9} 

Equation 9 is the basis for the preferred rate control 
technique, described below. 

The quantizers for encoding the N image blocks m 
a frame are preferably selected with the following 

teChniqU6 - STEP1 . Initi alization. 

Bl= B (available bits), N.-N (number of blocks)-. Let 

E, = I<J;S;-;- {"Equation 1q] 

where o k and S k are defined in equations 4 and 2, 
respectively. If the values of the parameters K and C 
respect y estimated in advance, e.g., 

the encoder model are known ^ 
using linear regression, let K 1= K and C, 

4- v ^nrl c to some small non- 
parameters are not known, set K, and C, to s 

negative values, such as K 1= 0.5 and C 1= 0 as initial 
estimates. In video coding, one could set K, and C, to the 
itZ K N+1 and C N+1 , respectively, from the previous encoded 
frame, or any other suitable value. 

STEP 2. The quantization parameter for the ith 

block is computed as follows: 

AK i E ^Equation 1^ 
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If the values o£ the Q-parameters are restricted to a fixed 
set (e g., in H.263, Q.-2QP and QP takes values m 

! 2 3 ,31.), round ft to the nearest value » the set. 
^ sguare root operation can be i m ple m ented using look-up 

tablES ' STEP 3 The ith block is encoded with a block- 



10 



-B,,=B i -B i , . of decoder 

„ v , ana <-i+i p 0 r 
Th e parameters Ki.. and Cl „.C. 

' \ F or the n-a using any 

^el are updated. ^ are deter ^ on 

„tive mode, Kl " j , fitting- For „ din g United 
the lie technic JU**- i» <=o-P«* 

3Ul " use tne »del W"*^ Ho. °"°° 8 ' 131 ' 
could use . ooli cation, se 

SC ates ^"^^eference here!" bloclcs are 

— - 5 . I£ -. n r; o ^ S te P2 . 
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